Special Data Types: Strings + Dates

Monday, April 29

Today we will…

  • Reminder: no class on Wednesday!
  • Comments from Week 4
  • New Material
    • String Variables
    • Regular Expressions
  • PA 5.1: Scrambled Message

Comments from Week 4

When describing data, include context as well as the data characteristics.

  • Where did the data come from? What years? Location? Source?
  • What is the data being used for?
  • What are the variables (in context) and observations (in context)?

Comments from Week 4

Read


Average

Total

Which or For Each

Minimum

Maximum

Think


summarize(avg = mean())

summarize(total = sum())

group_by()

slice_min()

slice_max()

String Variables

What is a string?

A string is a bunch of characters.

There is a difference between…

…a string (many characters, one object)…

and

…a character vector (vector of strings).

. . .

my_string <- "Hi, my name is Bond!"
my_vector <- c("Hi", "my", "name", "is", "Bond")
my_string
[1] "Hi, my name is Bond!"
my_vector
[1] "Hi"   "my"   "name" "is"   "Bond"

stringr

Common tasks

  • Identify strings containing a particular pattern.
  • Remove or replace a pattern.
  • Edit a string (e.g., make it lowercase).
knitr::include_graphics("https://github.com/rstudio/hex-stickers/blob/main/PNG/stringr.png?raw=true")

Note
  • The stringr package loads with tidyverse.
  • All functions are of the formstr_xxx().

pattern =

The pattern argument appears in many stringr functions .

  • The pattern must be supplied inside quotes.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")

str_detect(my_vector, pattern = "Bond")
str_locate(my_vector, pattern = "Bond")
str_match(my_vector, pattern = "Bond")
str_extract(my_vector, pattern = "Bond")
str_subset(my_vector, pattern = "Bond")


. . .

Let’s explore these functions!

str_detect()

Returns a logical vector indicating whether the pattern was found in each element of the supplied vector.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_detect(my_vector, pattern = "Bond")
[1] FALSE FALSE  TRUE  TRUE

. . .

  • Pairs well with filter().
  • Works with summarise() + sum or mean.

. . .

str_which() returns the indexes of the strings that contain a match.

str_match()

Returns a character matrix containing either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_match(my_vector, pattern = "Bond")
     [,1]  
[1,] NA    
[2,] NA    
[3,] "Bond"
[4,] "Bond"

. . .

The matrix will have more columns if you use regex groups.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_match(my_vector, pattern = "(.)o(.)")
     [,1]  [,2] [,3]
[1,] "lo," "l"  "," 
[2,] NA    NA   NA  
[3,] "Bon" "B"  "n" 
[4,] "Bon" "B"  "n" 

str_extract()

Returns a character vector with either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_extract(my_vector, pattern = "Bond")
[1] NA     NA     "Bond" "Bond"

. . .

Warning

str_extract() only returns the first pattern match.

Use str_extract_all() to return every pattern match.

str_locate()

Returns a dateframe with two numeric variables – the starting and ending location of the pattern.

  • The values are NA if the pattern is not found.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_locate(my_vector, pattern = "Bond")
     start end
[1,]    NA  NA
[2,]    NA  NA
[3,]     1   4
[4,]     7  10

. . .

Related Function

str_sub() extracts values based on a starting and ending location.

str_subset()

Returns a character vector containing a subset of the original character vector consisting of the elements where the pattern was found.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_subset(my_vector, pattern = "Bond")
[1] "Bond"       "James Bond"

Try it out!

my_vector <- c("I scream,", "you scream", "we all",
               "scream","for","ice cream")

str_detect(my_vector, pattern = "cream")
str_locate(my_vector, pattern = "cream")
str_match(my_vector, pattern = "cream")
str_extract(my_vector, pattern = "cream")
str_subset(my_vector, pattern = "cream")
Note

For each of these functions, write down:

  • the object structure of the output.
  • the data type of the output.
  • a brief explanation of what they do.

Replace / Remove Patterns

Replace the first matched pattern in each string.

  • Pairs well with mutate().
str_replace(my_vector, pattern = "Bond", replace = "Franco")
[1] "Hello,"       "my name is"   "Franco"       "James Franco"

str_replace_all() replaces all matched patterns in each string.

Remove the first matched pattern in each string.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_remove(my_vector, pattern = "Bond")
[1] "Hello,"     "my name is" ""           "James "    

This is a special case of str_replace(x, pattern, replace = "").

str_remove_all() removes all matched patterns in each string.

Edit Strings

Convert letters in a string to a specific capitalization format.

str_to_lower() converts all letters in a string to lowercase.


my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_to_lower(my_vector)
[1] "hello,"     "my name is" "bond"       "james bond"

str_to_upper() converts all letters in a string to uppercase.


str_to_upper(my_vector)
[1] "HELLO,"     "MY NAME IS" "BOND"       "JAMES BOND"

str_to_title() converts the first letter of each word to uppercase.


str_to_title(my_vector)
[1] "Hello,"     "My Name Is" "Bond"       "James Bond"

Combine Strings

Join multiple strings into a single string.

prompt <- "Hello, my name is"
first  <- "James"
last   <- "Bond"
str_c(prompt, last, ",", first, last, sep = " ")
[1] "Hello, my name is Bond , James Bond"
Note

Similar to paste() and paste0().

Combine a vector of strings into a single string.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_flatten(my_vector, collapse = " ")
[1] "Hello, my name is Bond James Bond"
Note

str_c() will do the same thing, but you should use str_flatten() instead!

Use variables in the environment to create a string based on {expressions}.

first <- "James"
last <- "Bond"
str_glue("My name is {last}, {first} {last}")
My name is Bond, James Bond
Tip

See the R package glue!

Tips for String Success

  • Refer to the stringr cheatsheet

  • Remember that str_xxx functions need the first argument to be a vector of strings, not a dataset!

    • You will use these functions inside dplyr verbs like filter() or mutate().
cereal |> 
  mutate(is_bran = str_detect(name, "Bran"), 
         .after = name)
name is_bran manuf type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
100% Bran TRUE N cold 70 4 1 130 10.0 5.0 6 280 25 3 1.00 0.33 68.40297
100% Natural Bran TRUE Q cold 120 3 5 15 2.0 8.0 8 135 0 3 1.00 1.00 33.98368
All-Bran TRUE K cold 70 4 1 260 9.0 7.0 5 320 25 3 1.00 0.33 59.42551
All-Bran with Extra Fiber TRUE K cold 50 4 0 140 14.0 8.0 0 330 25 3 1.00 0.50 93.70491
Almond Delight FALSE R cold 110 2 2 200 1.0 14.0 8 -1 25 3 1.00 0.75 34.38484
Apple Cinnamon Cheerios FALSE G cold 110 2 2 180 1.5 10.5 10 70 25 1 1.00 0.75 29.50954
Apple Jacks FALSE K cold 110 2 0 125 1.0 11.0 14 30 25 2 1.00 1.00 33.17409
Basic 4 FALSE G cold 130 3 2 210 2.0 18.0 8 100 25 3 1.33 0.75 37.03856
Bran Chex TRUE R cold 90 2 1 200 4.0 15.0 6 125 25 1 1.00 0.67 49.12025
Bran Flakes TRUE P cold 90 3 0 210 5.0 13.0 5 190 25 3 1.00 0.67 53.31381
Cap'n'Crunch FALSE Q cold 120 1 2 220 0.0 12.0 12 35 25 2 1.00 0.75 18.04285
Cheerios FALSE G cold 110 6 2 290 2.0 17.0 1 105 25 1 1.00 1.25 50.76500
Cinnamon Toast Crunch FALSE G cold 120 1 3 210 0.0 13.0 9 45 25 2 1.00 0.75 19.82357
Clusters FALSE G cold 110 3 2 140 2.0 13.0 7 105 25 3 1.00 0.50 40.40021
Cocoa Puffs FALSE G cold 110 1 1 180 0.0 12.0 13 55 25 2 1.00 1.00 22.73645
Corn Chex FALSE R cold 110 2 0 280 0.0 22.0 3 25 25 1 1.00 1.00 41.44502
Corn Flakes FALSE K cold 100 2 0 290 1.0 21.0 2 35 25 1 1.00 1.00 45.86332
Corn Pops FALSE K cold 110 1 0 90 1.0 13.0 12 20 25 2 1.00 1.00 35.78279
Count Chocula FALSE G cold 110 1 1 180 0.0 12.0 13 65 25 2 1.00 1.00 22.39651
Cracklin' Oat Bran TRUE K cold 110 3 3 140 4.0 10.0 7 160 25 3 1.00 0.50 40.44877
Cream of Wheat (Quick) FALSE N hot 100 3 0 80 1.0 21.0 0 -1 0 2 1.00 1.00 64.53382
Crispix FALSE K cold 110 2 0 220 1.0 21.0 3 30 25 3 1.00 1.00 46.89564
Crispy Wheat & Raisins FALSE G cold 100 2 1 140 2.0 11.0 10 120 25 3 1.00 0.75 36.17620
Double Chex FALSE R cold 100 2 0 190 1.0 18.0 5 80 25 3 1.00 0.75 44.33086
Froot Loops FALSE K cold 110 2 1 125 1.0 11.0 13 30 25 2 1.00 1.00 32.20758
Frosted Flakes FALSE K cold 110 1 0 200 1.0 14.0 11 25 25 1 1.00 0.75 31.43597
Frosted Mini-Wheats FALSE K cold 100 3 0 0 3.0 14.0 7 100 25 2 1.00 0.80 58.34514
Fruit & Fibre Dates; Walnuts; and Oats FALSE P cold 120 3 2 160 5.0 12.0 10 200 25 3 1.25 0.67 40.91705
Fruitful Bran TRUE K cold 120 3 0 240 5.0 14.0 12 190 25 3 1.33 0.67 41.01549
Fruity Pebbles FALSE P cold 110 1 1 135 0.0 13.0 12 25 25 2 1.00 0.75 28.02576
Golden Crisp FALSE P cold 100 2 0 45 0.0 11.0 15 40 25 1 1.00 0.88 35.25244
Golden Grahams FALSE G cold 110 1 1 280 0.0 15.0 9 45 25 2 1.00 0.75 23.80404
Grape Nuts Flakes FALSE P cold 100 3 1 140 3.0 15.0 5 85 25 3 1.00 0.88 52.07690
Grape-Nuts FALSE P cold 110 3 0 170 3.0 17.0 3 90 25 3 1.00 0.25 53.37101
Great Grains Pecan FALSE P cold 120 3 3 75 3.0 13.0 4 100 25 3 1.00 0.33 45.81172
Honey Graham Ohs FALSE Q cold 120 1 2 220 1.0 12.0 11 45 25 2 1.00 1.00 21.87129
Honey Nut Cheerios FALSE G cold 110 3 1 250 1.5 11.5 10 90 25 1 1.00 0.75 31.07222
Honey-comb FALSE P cold 110 1 0 180 0.0 14.0 11 35 25 1 1.00 1.33 28.74241
Just Right Crunchy Nuggets FALSE K cold 110 2 1 170 1.0 17.0 6 60 100 3 1.00 1.00 36.52368
Just Right Fruit & Nut FALSE K cold 140 3 1 170 2.0 20.0 9 95 100 3 1.30 0.75 36.47151
Kix FALSE G cold 110 2 1 260 0.0 21.0 3 40 25 2 1.00 1.50 39.24111
Life FALSE Q cold 100 4 2 150 2.0 12.0 6 95 25 2 1.00 0.67 45.32807
Lucky Charms FALSE G cold 110 2 1 180 0.0 12.0 12 55 25 2 1.00 1.00 26.73451
Maypo FALSE A hot 100 4 1 0 0.0 16.0 3 95 25 2 1.00 1.00 54.85092
Muesli Raisins; Dates; & Almonds FALSE R cold 150 4 3 95 3.0 16.0 11 170 25 3 1.00 1.00 37.13686
Muesli Raisins; Peaches; & Pecans FALSE R cold 150 4 3 150 3.0 16.0 11 170 25 3 1.00 1.00 34.13976
Mueslix Crispy Blend FALSE K cold 160 3 2 150 3.0 17.0 13 160 25 3 1.50 0.67 30.31335
Multi-Grain Cheerios FALSE G cold 100 2 1 220 2.0 15.0 6 90 25 1 1.00 1.00 40.10596
Nut&Honey Crunch FALSE K cold 120 2 1 190 0.0 15.0 9 40 25 2 1.00 0.67 29.92429
Nutri-Grain Almond-Raisin FALSE K cold 140 3 2 220 3.0 21.0 7 130 25 3 1.33 0.67 40.69232
Nutri-grain Wheat FALSE K cold 90 3 0 170 3.0 18.0 2 90 25 3 1.00 1.00 59.64284
Oatmeal Raisin Crisp FALSE G cold 130 3 2 170 1.5 13.5 10 120 25 3 1.25 0.50 30.45084
Post Nat. Raisin Bran TRUE P cold 120 3 1 200 6.0 11.0 14 260 25 3 1.33 0.67 37.84059
Product 19 FALSE K cold 100 3 0 320 1.0 20.0 3 45 100 3 1.00 1.00 41.50354
Puffed Rice FALSE Q cold 50 1 0 0 0.0 13.0 0 15 0 3 0.50 1.00 60.75611
Puffed Wheat FALSE Q cold 50 2 0 0 1.0 10.0 0 50 0 3 0.50 1.00 63.00565
Quaker Oat Squares FALSE Q cold 100 4 1 135 2.0 14.0 6 110 25 3 1.00 0.50 49.51187
Quaker Oatmeal FALSE Q hot 100 5 2 0 2.7 -1.0 -1 110 0 1 1.00 0.67 50.82839
Raisin Bran TRUE K cold 120 3 1 210 5.0 14.0 12 240 25 2 1.33 0.75 39.25920
Raisin Nut Bran TRUE G cold 100 3 2 140 2.5 10.5 8 140 25 3 1.00 0.50 39.70340
Raisin Squares FALSE K cold 90 2 0 0 2.0 15.0 6 110 25 3 1.00 0.50 55.33314
Rice Chex FALSE R cold 110 1 0 240 0.0 23.0 2 30 25 1 1.00 1.13 41.99893
Rice Krispies FALSE K cold 110 2 0 290 0.0 22.0 3 35 25 1 1.00 1.00 40.56016
Shredded Wheat FALSE N cold 80 2 0 0 3.0 16.0 0 95 0 1 0.83 1.00 68.23588
Shredded Wheat 'n'Bran TRUE N cold 90 3 0 0 4.0 19.0 0 140 0 1 1.00 0.67 74.47295
Shredded Wheat spoon size FALSE N cold 90 3 0 0 3.0 20.0 0 120 0 1 1.00 0.67 72.80179
Smacks FALSE K cold 110 2 1 70 1.0 9.0 15 40 25 2 1.00 0.75 31.23005
Special K FALSE K cold 110 6 0 230 1.0 16.0 3 55 25 1 1.00 1.00 53.13132
Strawberry Fruit Wheats FALSE N cold 90 2 0 15 3.0 15.0 5 90 25 2 1.00 1.00 59.36399
Total Corn Flakes FALSE G cold 110 2 1 200 0.0 21.0 3 35 100 3 1.00 1.00 38.83975
Total Raisin Bran TRUE G cold 140 3 1 190 4.0 15.0 14 230 100 3 1.50 1.00 28.59278
Total Whole Grain FALSE G cold 100 3 1 200 3.0 16.0 3 110 100 3 1.00 1.00 46.65884
Triples FALSE G cold 110 2 1 250 0.0 21.0 3 60 25 3 1.00 0.75 39.10617
Trix FALSE G cold 110 1 1 140 0.0 13.0 12 25 25 2 1.00 1.00 27.75330
Wheat Chex FALSE R cold 100 3 1 230 3.0 17.0 3 115 25 1 1.00 0.67 49.78744
Wheaties FALSE G cold 100 3 1 200 3.0 17.0 3 110 25 1 1.00 1.00 51.59219
Wheaties Honey Gold FALSE G cold 110 2 1 200 1.0 16.0 8 60 25 1 1.00 0.75 36.18756

Tips for String Success

The real power of these str_xxx functions comes when you specify the pattern using regular expressions!

knitr::include_graphics("images/regular_expressions.png")

regex

Regular Expressions

“Regexps are a very terse language that allow you to describe patterns in strings.”

R for Data Science

. . .

Use str_xxx functions + regular expressions!

str_detect(string  = my_string_vector,
           pattern = "p[ei]ck[a-z]")

. . .

Tip

You might encounter gsub(), grep(), etc. from Base R.

Regular Expressions

Regular expressions are tricky!

  • There are lots of new symbols to keep straight.
  • There are a lot of cases to think through.


This web app for testing R regular expressions might be handy!

Special Characters

There is a set of characters that have a specific meaning when using regex.

  • The stringr package does not read these as normal characters.
  • These characters are:

. ^ $ \ | * + ? { } [ ] ( )

Wild Card Character: .

. – matches any character.

x <- c("She", "sells", "seashells", "by", "the", "seashore!")
str_subset(x, pattern = ".ells")
[1] "sells"     "seashells"


This matches strings that contain any character followed by “ells”.

Anchor Characters: ^ $

^ – looks at the beginning of a string.

x <- c("She", "sells", "seashells", "by", "the", "seashore!")
str_subset(x, pattern = "^s")
[1] "sells"     "seashells" "seashore!"

This matches strings that start with “s”.

. . .

$ – looks at the end of a string.

str_subset(x, pattern = "s$")
[1] "sells"     "seashells"

This matches strings that end with “s”.

Quantifier Characters: ? + *

? – matches when the preceding character occurs 0 or 1 times in a row.

x <- c("shes", "shels", "shells", "shellls", "shelllls")
str_subset(x, pattern = "shel?s")
[1] "shes"  "shels"

. . .

+ – … occurs 1 or more times in a row.

str_subset(x, pattern = "shel+s")
[1] "shels"    "shells"   "shellls"  "shelllls"

. . .

* – … occurs 0 or more times in a row.

str_subset(x, pattern = "shel*s")
[1] "shes"     "shels"    "shells"   "shellls"  "shelllls"

Quantifier Characters: {}

{n} – matches when the preceding character occurs exactly n times in a row.

x <- c("shes", "shels", "shells", "shellls", "shelllls")
str_subset(x, pattern = "shel{2}s")
[1] "shells"

. . .

{n,} – … occures at least n times in a row.

str_subset(x, pattern = "shel{2,}s")
[1] "shells"   "shellls"  "shelllls"

. . .

{n,m} – … occurs between n and m times in a row.

str_subset(x, pattern = "shel{1,3}s")
[1] "shels"   "shells"  "shellls"

Character Groups: ()

Groups are created with ( ).

  • We can specify “either” / “or” within a group using |.
x <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
str_subset(x, pattern = "p(e|i)ck")
[1] "picked"  "peck"    "pickled"


This matches strings that contain either “peck” or “pick”.

Character Classes: []

Character classes let you specify multiple possible characters to match on.

x <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
str_subset(x, pattern = "p[ei]ck")
[1] "picked"  "peck"    "pickled"

. . .

[^ ] – specifies characters not to match on (think except)

str_subset(x, pattern = "p[^i]ck")
[1] "peck"

. . .

[Pp] – capitalization matters!

str_subset(x, pattern = "^p")
[1] "picked"   "peck"     "pickled"  "peppers!"
str_subset(x, pattern = "^[Pp]")
[1] "Peter"    "Piper"    "picked"   "peck"     "pickled"  "peppers!"

Character Classes: []

[ - ] – specifies a range of characters.

x <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
str_subset(x, pattern = "p[ei]ck[a-z]")
[1] "picked"  "pickled"

. . .

  • [A-Z] matches any capital letter.
  • [a-z] matches any lowercase letter.
  • [A-z] or [:alpha:] matches any letter
  • [0-9] or [:digit:] matches any number
  • See the stringr cheatsheet for more shortcuts, like [:punct:]

Shortcuts

\\w – matches any “word” (\\W matches not “word”)

  • A “word” contains any letters and numbers.

\\d – matches any digit (\\D matches not digit)

\\s – matches any whitespace (\\S matches not whitespace)

  • Whitespace includes spaces, tabs, newlines, etc.

. . .

x <- "phone number: 1234567899"
str_extract(x, pattern = "\\d+")
[1] "1234567899"
str_extract_all(x, pattern = "\\S+")
[[1]]
[1] "phone"      "number:"    "1234567899"

Try it out!

What regular expressions would match words that…

  • end with a vowel?
  • start with x, y, or z?
  • do not contain x, y, or z?
  • contain British spelling?
x <- c("zebra", "xray", "apple", "yellow",
       "color", "colour", "summarize", "summarise")
Code
str_subset(x, "[aeiouy]$")
str_subset(x, "^[xyz]")
str_subset(x, "^[^xyz]+$")
str_subset(x, "(our)|(i[sz]e)")

Escape: \\

To match a special character, you need to escape it.

x <- c("How", "much", "wood", "could", "a", "woodchuck", "chuck",
       "if", "a", "woodchuck", "could", "chuck","wood?")
str_subset(x, pattern = "?")
Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`?`)

. . .

Use \\ to escape the ? – it is now read as a normal character.

str_subset(x, pattern = "\\?")
[1] "wood?"

. . .

Note

Alternatively, you could use []:

str_subset(x, pattern = "[?]")
[1] "wood?"

When in Doubt


knitr::include_graphics("images/backslashes.png")

Use the web app to test R regular expressions.

Tips for working with regex

  • Read the regular expressions out loud like a request.

. . .

  • Test out your expressions on small examples first.
str_view(c("shes", "shels", "shells", "shellls", "shelllls"), "l+")
[2] │ she<l>s
[3] │ she<ll>s
[4] │ she<lll>s
[5] │ she<llll>s

. . .

. . .

  • Be kind to yourself when working with regular expressions!

Strings in the tidyverse

stringr functions + dplyr verbs!

Country names with a (capital or lowercase) “Z”?

military |> 
  filter(str_detect(Country, "[Zz]")) |> 
  distinct(Country)
Country
Mozambique
Tanzania
Zambia
Zimbabwe
Belize
Brazil
Venezuela
Kazakhstan
Kyrgyzstan
Uzbekistan
New Zealand
Bosnia-Herzegovina
Czechia
Czechoslovakia
Azerbaijan
Switzerland

. . .

The proportion of country names with a compass direction?

military |> 
  distinct(Country) |> 
  summarize(prop = mean(str_detect(Country,
                                   "[Nn]orth|[Ss]outh|[Ee]ast|[Ww]est")))
# A tibble: 1 × 1
    prop
   <dbl>
1 0.0789

matches(pattern)

Select all variables with a name that matches the supplied pattern.

  • Pairs well with select(), rename_with(), and across().
military_clean <- military |> 
  mutate(across(`1988`:`2019`, 
                ~ na_if(.x, y = ". .")),
         across(`1988`:`2019`, 
                ~ na_if(.x, y = "xxx")))
military_clean <- military |> 
  mutate(across(matches("[1-9]{4}"), 
                ~ na_if(.x, y = ". .")),
         across(matches("[1-9]{4}"), 
                ~ na_if(.x, y = "xxx")))

Messy Covid Variants!

What is that variable?!

[{'variant': 'Other', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 4.59}, {'variant': 'V-20DEC-01 (Alpha)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21APR-02 (Delta B.1.617.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21OCT-01 (Delta AY 4.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-22DEC-01 (Omicron CH.1.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 24.56}, {'variant': 'V-22JUL-01 (Omicron BA.2.75)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 8.93}, {'variant': 'V-22OCT-01 (Omicron BQ.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 49.57}, {'variant': 'VOC-21NOV-01 (Omicron BA.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.02}, {'variant': 'VOC-22APR-03 (Omicron BA.4)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.08}, {'variant': 'VOC-22APR-04 (Omicron BA.5)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.59}, {'variant': 'VOC-22JAN-01 (Omicron BA.2)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 1.41}, {'variant': 'unclassified_variant', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.26}]

PA 5.1: Scrambled Message

In this activity, you will use regex to decode a message.

  • Remember: stringr functions go inside dplyr verbs like mutate() and filter() – use them like as.factor.

. . .

x <- c("She", "sells", "seashells", "by", "the", "seashore!")
  • Grab elements out of a vector with [].
x[c(1,4,5)]
[1] "She" "by"  "the"
  • To replace those elements, use <- to assign new values.
x[c(1,4,5)] <- ""

To do…

  • PA 5.1: Scrambled Message
    • Due Saturday, 5/4 at 11:59pm

Wednesday, May 1

Today we will…

Midterm Exam – Wednesday, 5/8

  • This is a three-part exam:
    1. You will first complete a General Questions section on paper and without your computer.
    2. After you turn that in, you will complete a Short Answer section with your computer.
    • You will have the one hour and 50 minute class period to complete the first two sections.
    1. The third section, Open-Ended Analysis, will be started in class and due 24 hours after the end of class.

Midterm Exam – Wednesday, 5/8

  • The exam is worth approximately 90 points.
    • Approx. 20 pts, 30 pts, and 40 pts for the three sections.
  • I will provide a .qmd template for the Short Answer.
  • You will create your own .qmd for the Open-Ended Analysis. You are encouraged to create this ahead of time.
Caution

While the coding tasks are open-resource, you will likely run out of time if you have to look everything up. Know what functions you might need and where to find documentation for implementing these functions!

Date + Time Variables

Why are dates and times tricky?

When parsing dates and times, we have to consider complicating factors like…

  • Daylight Savings Time.
    • One day a year is 23 hours; one day a year is 25 hours.
    • Some places use it, some don’t.
  • Leap years – most years have 365 days, some have 366.
  • Time zones.

lubridate

Common Tasks

  • Convert a date-like variable (“May 8, 1995”) to a date or date-time object.

  • Find the weekday, month, year, etc from a date-time object.

  • Convert between time zones.

knitr::include_graphics("https://github.com/rstudio/hex-stickers/blob/main/thumbs/lubridate.png?raw=true")

Note

The lubridate package installs with tidyverse, but does not load.

library(lubridate)

date-time Objects

There are multiple data types for dates and times.

  • A date:
    • date or Date
  • A date and a time (identifies a unique instant in time):
    • dtm
    • POSIXlt – stores date-times as the number of seconds since January 1, 1970 (“Unix Epoch”)
    • POSIXct – stores date-times as a list with elements for second, minute, hour, day, month, year, etc.

Creating date-time Objects

Create a date from individual components:

make_date(year = 1995, month = 05, day = 08)
[1] "1995-05-08"

. . .

Create a date from a string:

mdy("May 8, 1995")
[1] "1995-05-08"
dmy("8-May-1995", tz = "America/Chicago")
[1] "1995-05-08 CDT"
dmy_hms("8-May-1995 9:32:12", tz = "America/Chicago")
[1] "1995-05-08 09:32:12 CDT"
as_datetime("95-05-08", format = "%y-%m-%d")
[1] "1995-05-08 UTC"
parse_datetime("5/8/1995", format = "%m/%d/%Y")
[1] "1995-05-08 UTC"

Creating date-time Objects

Common Mistake with Dates

What’s wrong here?

as_datetime(2023-02-6)
[1] "1970-01-01 00:33:35 UTC"


my_date <- 2023-02-6
my_date
[1] 2015


. . .

Make sure you use quotes!

  • 2,015 seconds \(\approx\) 33.5 minutes

Extracting date-time Components

bday <- ymd_hms("1993-11-20 9:32:12", tz = "America/New_York")
bday
[1] "1993-11-20 09:32:12 EST"


year(bday)
[1] 1993
month(bday)
[1] 11
day(bday)
[1] 20
wday(bday)
[1] 7
wday(bday, label = TRUE, abbr = FALSE)
[1] Saturday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Subtraction with date-time Objects

Doing subtraction gives you a difftime object.

  • difftime objects do not always have the same units – it depends on the scale of the objects you are working with.

How old am I?

today() - mdy(11201993)
Time difference of 11288 days

How long did it take me to finish a typing challenge?

begin <- mdy_hms("3/1/2023 13:04:34")
finish <- mdy_hms("3/1/2023 13:06:11")
finish - begin
Time difference of 1.616667 mins

Durations and Periods

Durations will always give the time span in an exact number of seconds.

as.duration(today() - mdy(11201993))
[1] "975283200s (~30.9 years)"
as.duration(finish - begin)
[1] "97s (~1.62 minutes)"

. . .

Periods will give the time span in more approximate, but human readable times.

as.period(today() - mdy(11201993))
[1] "11288d 0H 0M 0S"
as.period(finish - begin)
[1] "1M 37S"

Durations and Periods

We can also add time:

  • days(), years(), etc. will add a period of time.
  • ddays(), dyears(), etc. will add a duration of time.

. . .

Because durations use the exact number of seconds to represent days and years, you might get unexpected results:

When is is my 99th birthday?

mdy(11201993) + years(99)
[1] "2092-11-20"
mdy(11201993) + dyears(99)
[1] "2092-11-19 18:00:00 UTC"

Time Zones

Time zones are complicated!

Specify time zones in the form:

  • {continent}/{city} – “America/New_York”, “Africa/Nairobi”
  • {ocean}/{city} – “Pacific/Auckland”

. . .

What time zone does R think I’m in?

Sys.timezone()
[1] "America/Los_Angeles"

Time Zones

You can change the time zone of a date in two ways:

x <- ymd_hms("2024-06-01 18:00:00", tz = "Europe/Copenhagen")

Keeps the instant in time the same, but changes the visual representation.

x |> 
  with_tz()
[1] "2024-06-01 09:00:00 PDT"
x |> 
  with_tz(tzone = "Asia/Kolkata")
[1] "2024-06-01 21:30:00 IST"

Changes the instant in time by forcing a time zone change.

x |> 
  force_tz()
[1] "2024-06-01 18:00:00 PDT"
x |> 
  force_tz(tzone = "Asia/Kolkata")
[1] "2024-06-01 18:00:00 IST"

Common Mistake with Dates

When you read data in or create a new date-time object, the default time zone (if not specified) is UTC.

  • UTC (Universal Time Coordinated) is the same as GMT (Greenwich Mean Time).

Make sure you specify your desired time zone!

x <- mdy("11/20/1993")
tz(x)
[1] "UTC"
x <- mdy("11/20/1993", tz = "America/New_York")
tz(x)
[1] "America/New_York"

PA 5.2: Jewel Heist

Just down the road in Montecito, CA several rare jewels went missing last fall. The jewels were stolen and replaced with fakes, but detectives have not been able to solve the case. They are now calling in a data scientist to help parse their clues.

Unfortunately, the date and time of the jewel heist is not known. You have been hired to crack the case. Use the clues below to discover the thief’s identity.

Submit the name of the thief to the Canvas Quiz.

Lab 5: Murder in SQL City

To do…

  • PA 5.2: Jewel Heist – due Saturday, 5/4 at 11:59pm.

  • Lab 5: Murder in SQL City – due Saturday, 5/4 at 11:59pm.

  • Read Chapter 6: Version Control

    • Check-ins 6.1 + 6.2 due Monday 5/6, at 10:00am